We will compare some classifiers on the “Toxic” column.
Load libraries
library(tidyverse)
package 㤼㸱tidyverse㤼㸲 was built under R version 4.0.5Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages ---------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.5 v purrr 0.3.4
v tibble 3.1.3 v dplyr 1.0.7
v tidyr 1.1.3 v stringr 1.4.0
v readr 2.0.1 v forcats 0.5.1
package 㤼㸱ggplot2㤼㸲 was built under R version 4.0.5package 㤼㸱tibble㤼㸲 was built under R version 4.0.5package 㤼㸱tidyr㤼㸲 was built under R version 4.0.5package 㤼㸱purrr㤼㸲 was built under R version 4.0.5package 㤼㸱dplyr㤼㸲 was built under R version 4.0.5package 㤼㸱stringr㤼㸲 was built under R version 4.0.5package 㤼㸱forcats㤼㸲 was built under R version 4.0.5-- Conflicts ------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(tictoc)
package 㤼㸱tictoc㤼㸲 was built under R version 4.0.5
library(caret)
package 㤼㸱caret㤼㸲 was built under R version 4.0.5Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: 㤼㸱caret㤼㸲
The following object is masked from 㤼㸱package:purrr㤼㸲:
lift
library(e1071)
package 㤼㸱e1071㤼㸲 was built under R version 4.0.5
source("./parameters.R")
# We open a relatively small Bag of Words in order to limit calculation time
fileName = "bow_tfidf__min_words_100_2grams_1000__sampling_balanced__cor_cut_1_from_1408_to_1408.csv"
df = read_csv(fileName, col_types=col_types_df)
df = df[,-c(2,3,5:9)]
df
df_train = df[df[1] == 1,-1]
df_test = df[df[1] == 2,-1]
# Let's control the labels balance
writeLines("Toxic labels in the train set:")
Toxic labels in the train set:
table(df_train$df_toxic)
0 1
14688 14634
writeLines("\nToxic labels in the test set:")
Toxic labels in the test set:
table(df_test$df_toxic)
0 1
5783 5841
tic("Training: ")
f <- glm(df_toxic ~ ., data=df_train, family = 'binomial')
glm.fit: algorithm did not convergeglm.fit: fitted probabilities numerically 0 or 1 occurred
toc(log = TRUE)
Training: : 1392.45 sec elapsed
This takes some time…
What’s in f ?
f
Call: glm(formula = df_toxic ~ ., family = "binomial", data = df_train)
Coefficients:
(Intercept) abil abl absolut abus academ accept access accord account accur
9.929e+14 3.667e+13 -1.499e+14 3.109e+14 9.752e+13 1.236e+14 1.850e+13 -4.277e+14 -1.115e+14 -1.754e+14 1.900e+14
accus achiev across act action activ actual add addit address admin
-1.364e+13 -4.381e+13 3.501e+14 1.066e+14 5.092e+13 -4.083e+13 -8.125e+13 -2.410e+14 -9.720e+13 -8.493e+13 4.408e+13
administr admit advanc advertis advic afd afraid age agenda ago agre
2.197e+14 -7.874e+13 1.098e+14 -2.870e+14 -1.172e+14 -2.259e+14 -1.537e+12 -4.216e+13 -5.413e+12 2.512e+14 -4.406e+14
ahead aid air album alleg allow almost alon along alreadi also
6.499e+13 5.739e+12 -4.825e+14 -6.298e+13 3.467e+13 -7.176e+12 1.521e+14 1.737e+14 -2.767e+14 -6.792e+13 2.550e+13
alter altern although alway american among amount anal and angri ani
-7.277e+13 -1.150e+14 -7.025e+12 -1.702e+14 -1.325e+13 2.142e+14 2.700e+14 -9.836e+12 -4.307e+14 4.610e+13 -4.851e+14
anim annoy anonym anoth answer anti anybodi anymor anyon anyth anyway
1.146e+14 1.710e+14 -1.575e+14 -1.315e+14 -9.120e+13 1.316e+14 -1.136e+14 -4.354e+13 -1.261e+14 -1.250e+13 -1.442e+14
anywher apolog appar appeal appear appli applic appreci approach appropri arab
-1.760e+14 -2.919e+14 -7.762e+12 -2.650e+14 -1.395e+14 2.505e+14 2.446e+14 -4.954e+14 -5.150e+14 2.961e+13 4.601e+12
arbitr archiv area argu argument around arrog ars art articl artist
-1.736e+14 -1.615e+14 -8.623e+13 -2.296e+14 5.012e+13 2.907e+13 4.241e+14 1.446e+14 -2.386e+14 -9.802e+13 6.419e+13
ask aspect ass assert asshol assist associ assum attack attempt attent
-1.384e+14 4.071e+13 8.344e+12 1.771e+14 2.236e+14 -5.662e+14 -9.238e+13 -2.329e+14 1.438e+14 2.383e+14 1.253e+14
attitud attribut author automat avail avoid awar award away babi back
4.473e+14 -7.398e+14 7.316e+13 7.640e+12 1.263e+14 -7.763e+13 -8.661e+13 1.039e+14 1.701e+14 1.653e+14 1.651e+14
background bad bag balanc ball ban band barnstar base basi basic
-3.490e+12 4.091e+13 -7.973e+13 -2.958e+14 6.891e+12 1.542e+14 -5.184e+13 -3.379e+14 -2.052e+14 -1.443e+14 -1.418e+13
bastard beat becam becom begin behavior behaviour behind belief believ belong
2.981e+14 -7.093e+13 -2.433e+14 1.204e+14 8.595e+13 -1.861e+13 -1.355e+14 9.032e+13 -1.098e+14 -1.835e+14 1.917e+14
benefit besid best bet better beyond bias big biggest biographi bit
2.423e+14 -4.460e+14 -1.765e+14 -2.613e+13 -7.620e+13 -1.350e+14 1.032e+14 2.290e+14 1.618e+14 1.534e+14 2.034e+13
bitch black blame blank blatant blind block blog blood bloodi blow
1.012e+13 2.826e+14 1.693e+14 -3.299e+14 2.046e+14 -6.879e+13 -7.421e+12 -1.477e+14 2.747e+14 7.248e+13 2.831e+13
board bodi book border bore born bot bother bottom box boy
-1.250e+14 -2.705e+14 -2.259e+14 -5.112e+14 -4.694e+13 -6.259e+13 -1.537e+14 2.495e+14 -3.105e+14 -2.393e+14 9.630e+13
brain `break` bring british brother brought buddi build bulli bullshit bunch
1.958e+14 6.450e+13 -7.554e+13 -9.656e+12 1.003e+14 -2.378e+14 4.189e+14 -4.802e+14 3.849e+13 1.581e+13 1.591e+14
burn busi butt call came campaign can cant capit care carri
-3.286e+13 -1.152e+13 6.253e+14 7.549e+13 1.303e+14 3.284e+12 -6.741e+13 7.545e+13 -9.965e+13 -3.940e+13 -9.724e+13
case categori caus censor cent central centuri certain challeng chanc chang
-1.704e+14 -3.451e+14 5.963e+13 3.631e+13 -1.250e+13 -1.343e+15 -1.767e+14 -3.224e+14 1.430e+14 -3.990e+14 -9.158e+13
charact charg check cheer child children chines choic choos christian citat
-1.230e+14 2.096e+14 -2.893e+14 -7.980e+14 3.462e+14 2.902e+14 2.782e+13 2.848e+14 -1.827e+14 -8.971e+13 9.295e+12
cite citi civil claim clarifi class clean clear click close club
6.722e+13 -6.031e+13 -5.614e+13 -1.130e+14 -3.405e+14 -3.345e+14 1.495e+14 -2.066e+13 -1.622e+14 1.325e+14 -2.115e+14
cock cocksuck code collect colleg color com come comment commit common
4.458e+13 -2.602e+14 -1.501e+14 -1.406e+14 1.734e+13 8.577e+13 1.057e+14 9.889e+13 -1.182e+13 1.313e+14 -3.296e+14
communist communiti compani compar complain complaint complet comput concept concern conclus
2.693e+14 -3.360e+14 -1.583e+14 -1.032e+13 1.775e+14 2.470e+14 2.097e+14 -5.785e+13 -1.844e+14 -9.974e+13 1.002e+14
conduct confirm conflict confus connect consensus consid consider consist constant constitut
-2.003e+14 -7.000e+13 -1.059e+14 -2.233e+14 -5.029e+14 -2.231e+14 -2.040e+14 -4.882e+13 -2.666e+14 1.506e+14 1.470e+14
construct contact contain content contest context continu contrib contribut contributor control
-1.809e+14 -2.386e+14 1.335e+14 -4.133e+13 -2.593e+14 -2.496e+14 -2.839e+14 -4.117e+14 5.978e+13 1.426e+14 -6.380e+13
controversi convent convers convinc cool copi copyright correct corrupt count countri
5.655e+13 -4.227e+14 1.416e+14 -4.964e+14 -3.484e+14 -1.589e+14 1.262e+14 -2.765e+14 3.629e+14 -1.254e+14 3.728e+13
coupl cours court cover coward crap crazi creat creation credibl credit
-5.048e+14 -1.635e+14 -3.763e+14 5.868e+13 1.011e+14 3.154e+14 3.760e+13 -5.520e+13 -2.083e+14 -1.327e+14 -2.459e+14
cri crime crimin criteria critic cross cultur cunt current cut damag
-1.223e+14 -1.988e+14 -1.692e+14 2.645e+14 -2.353e+13 -4.884e+13 -1.559e+14 3.872e+13 6.810e+13 1.881e+14 -2.754e+13
damn dare data date day dead deal death debat decemb decid
2.056e+13 1.561e+14 4.671e+13 -2.634e+14 3.714e+13 9.605e+13 -1.255e+14 1.693e+14 -7.991e+13 -5.658e+14 4.668e+13
decis defend defin definit degre delet demand demonstr deni describ descript
-5.666e+14 7.525e+13 -5.414e+13 -3.827e+14 -3.539e+12 4.393e+13 2.505e+14 -4.786e+14 -4.412e+13 -1.070e+14 -3.219e+14
deserv design despit destroy detail determin develop dick dickhead didnt die
-1.003e+14 -2.008e+14 8.475e+13 3.554e+14 -3.579e+14 -4.407e+14 3.251e+13 1.573e+14 1.503e+13 -1.384e+14 7.531e+12
diff differ difficult direct dirti disagre discuss disgust display disput disrupt
5.482e+14 -5.913e+13 -1.226e+14 -5.146e+13 2.369e+14 -1.359e+13 -5.598e+13 2.728e+14 -1.678e+14 -3.687e+14 -8.798e+12
distinct document dog dollar done dont doubl doubt douch drive drop
-4.420e+14 2.850e+12 -6.900e+13 1.772e+14 -6.002e+13 7.059e+13 -1.302e+14 -2.475e+14 4.640e+14 7.857e+13 -6.693e+13
dude due dumb dumbass earli earlier earth easi easier easili eat
-1.623e+14 1.008e+14 4.266e+14 3.869e+14 -5.245e+14 -6.618e+13 1.357e+13 2.584e+14 -3.858e+14 -1.487e+14 8.729e+13
ect edit editor educ effect effort either elect els elsewher email
-1.248e+14 -5.380e+13 3.080e+13 6.106e+12 2.464e+13 2.546e+14 -1.704e+14 2.331e+14 8.802e+13 3.590e+14 -3.941e+14
encourag encycloped encyclopedia end engag engin enjoy enough enter entir entri
-2.789e+14 1.465e+14 3.293e+14 -4.219e+13 -7.774e+13 -3.175e+14 1.728e+14 1.075e+14 -2.229e+14 1.521e+14 6.497e+13
episod equal eras erienc eriment error ert especi establish etc ethnic
-1.237e+14 4.319e+13 3.049e+14 -1.890e+14 -2.252e+14 -1.217e+14 3.451e+13 3.020e+14 -4.662e+14 -1.450e+13 -1.796e+13
european even event eventu ever everi everybodi everyon everyth evid evil
-2.156e+14 7.956e+13 -1.028e+14 -2.689e+14 1.363e+14 1.704e+14 1.837e+14 -5.817e+13 -3.828e+13 -3.240e+12 1.806e+14
exact exampl except excus exist extens extern extrem eye face fact
-1.838e+14 -2.241e+14 -5.975e+13 4.925e+14 -1.156e+14 8.709e+12 3.364e+14 1.238e+14 3.263e+14 2.386e+14 7.246e+13
factual fag faggot fail fair faith fake fall fals famili familiar
3.865e+14 1.126e+13 1.959e+13 4.257e+14 2.135e+13 -7.478e+13 1.317e+14 1.607e+12 1.102e+14 -2.752e+13 -8.590e+14
famous fan far fascist fat father favor featur februari feel fellow
-1.096e+14 4.319e+13 -1.661e+14 3.674e+14 -8.275e+12 3.806e+14 3.428e+14 1.676e+14 -7.999e+14 -5.616e+12 1.299e+14
felt femal field fight figur file fill film final find fine
2.906e+14 -1.750e+14 -4.983e+14 -9.900e+13 -2.990e+14 -8.721e+13 -2.069e+14 -1.978e+13 1.396e+14 -2.109e+14 -3.354e+14
finish fire first fit five fix flag fli focus folk follow
-1.297e+14 -1.708e+14 -1.031e+14 2.862e+14 -1.157e+14 -2.339e+14 -2.768e+14 2.242e+14 -4.671e+14 2.234e+13 9.161e+13
fool footbal forc forev forget forgot form formal format former forum
-7.101e+11 -2.531e+14 1.444e+14 5.630e+14 -6.883e+12 -5.486e+14 -2.312e+14 -6.287e+13 -2.121e+14 -2.783e+13 1.990e+14
forward found four frank freak free freedom frequent friend front frown
-4.491e+14 -6.861e+13 -4.133e+14 2.333e+14 3.449e+14 -8.558e+13 -1.790e+14 -5.774e+14 -1.219e+13 5.158e+14 -2.016e+14
fuck fucker fuckin full fulli fun `function` funni furthermor futur game
8.543e+13 1.053e+14 2.228e+14 4.890e+13 3.020e+13 2.075e+14 -5.850e+13 5.336e+13 1.126e+14 -3.184e+14 -4.381e+13
garbag gave gay general get girl give given glad god goe
4.515e+14 1.392e+14 5.433e+13 -1.015e+14 1.107e+14 8.783e+13 1.583e+14 -1.972e+13 -2.899e+14 1.912e+14 2.191e+13
gone gonna good googl got govern great greek ground group grow
2.685e+14 3.917e+14 -2.200e+13 1.220e+14 1.357e+12 -1.324e+14 -3.079e+14 1.087e+14 -8.869e+13 -1.843e+14 1.449e+14
guess guid guidelin gun guy haha half hand handl happen happi
4.829e+13 4.173e+13 1.103e+14 -1.946e+14 1.582e+14 3.477e+14 8.921e+13 2.271e+14 -6.688e+12 1.087e+13 -2.331e+13
harass hard hate havent head hear heard heart held hell hello
-9.934e+13 -1.764e+13 1.816e+13 -1.507e+14 1.701e+14 -1.384e+14 -2.579e+13 -2.935e+13 2.071e+14 2.203e+14 9.381e+13
help henc here hesit hey hide high higher histor histori hit
-9.717e+13 2.931e+13 3.112e+14 2.245e+14 3.794e+14 -3.265e+14 2.538e+14 1.334e+14 -1.324e+14 -1.066e+13 7.648e+12
hitler hold hole home homo homosexu honest hope horribl horror hot
1.659e+13 -9.417e+13 -3.076e+13 4.918e+13 -1.371e+14 3.982e+14 -1.818e+14 -6.891e+12 8.251e+14 -3.672e+12 5.130e+14
hour hous howev http huge human hundr hurt hypocrit idea ident
8.416e+13 -1.915e+13 -1.864e+14 -4.598e+14 -1.867e+13 1.341e+14 2.801e+14 1.301e+14 1.212e+14 -1.693e+14 -1.443e+14
identifi idiot ignor ill imag imagin immedi impli import impress improv
-2.713e+14 2.408e+14 2.935e+14 1.252e+14 -1.693e+13 1.349e+14 2.263e+14 -3.554e+14 -1.645e+14 3.176e+14 -2.912e+13
inappropri incid includ inclus incorrect increas inde independ indian indic individu
1.613e+14 -4.150e+14 -1.715e+14 -1.691e+14 -1.871e+14 -2.715e+14 -1.391e+14 -3.798e+14 -1.424e+14 1.937e+14 2.031e+14
info infobox inform initi input insert insist instanc instead insult intellig
-1.441e+14 -3.766e+14 -8.738e+13 -1.112e+14 -5.393e+14 2.974e+14 2.974e+14 8.821e+13 1.729e+14 2.913e+14 4.673e+14
intend intent interest intern internet interpret introduc introduct investig invit involv
-1.528e+14 -1.768e+14 -3.669e+14 -2.185e+14 8.399e+13 1.096e+14 -5.229e+14 -3.951e+14 5.978e+13 -6.077e+14 -3.398e+14
irrelev issu item jerk jew jewish jimbo job join joke journal
-2.091e+14 -5.464e+13 -3.500e+14 8.151e+14 1.205e+13 9.491e+12 1.433e+14 1.789e+14 2.482e+14 3.194e+14 -4.810e+14
jpg judg jump just justifi keep kept kick kid kill kind
-6.599e+13 -9.059e+13 3.649e+14 9.223e+13 6.479e+12 -1.322e+13 -4.405e+14 6.941e+13 4.591e+13 4.981e+13 -7.233e+12
king kiss knew know knowledg known label lack lain lanat land
1.022e+14 -7.567e+13 1.123e+14 -6.017e+13 -4.248e+14 -6.366e+13 -8.045e+13 1.215e+14 -9.791e+13 -1.910e+14 1.154e+14
languag larg last late later laugh law lazi lead learn least
-9.206e+13 2.598e+14 2.794e+14 2.932e+13 -3.162e+13 9.439e+13 -7.962e+13 7.934e+14 -2.103e+14 -1.361e+14 -5.727e+13
leav left legal legitim less let level liar liber licens licit
1.487e+14 -1.403e+14 2.530e+14 2.057e+14 2.559e+14 -8.506e+13 -6.958e+13 5.713e+14 1.705e+14 6.273e+13 -6.899e+13
lick lie life light like limit line link list listen liter
9.549e+13 2.402e+14 1.381e+14 -3.264e+14 1.347e+14 4.957e+13 1.719e+13 -9.011e+13 -2.573e+14 1.828e+14 5.112e+14
littl live load local locat lock log logic long longer look
1.360e+14 1.219e+14 2.956e+14 2.043e+14 4.371e+13 3.104e+13 9.094e+13 -5.730e+13 -1.604e+14 6.900e+13 -1.042e+14
lose loser lost lot loud love low luck mad made magazin
8.778e+13 2.942e+13 1.403e+14 8.415e+13 1.187e+14 2.039e+13 1.278e+14 6.187e+13 2.534e+14 -1.098e+14 4.315e+14
mail main maintain major make male man manag mani manner map
1.239e+14 -4.194e+13 -2.482e+14 -8.787e+13 -6.005e+13 9.531e+13 1.563e+14 -3.353e+14 -6.800e+13 1.864e+14 -5.196e+14
mark mass massiv master match mate materi matter may mayb mean
-9.910e+13 3.576e+14 1.271e+14 2.531e+14 6.876e+13 2.658e+14 7.816e+13 1.495e+14 -9.070e+13 -4.858e+12 -3.679e+13
meant media meet member men mental mention mere merg mess messag
-2.456e+14 2.626e+14 -1.068e+14 1.360e+14 -3.693e+13 1.511e+14 -1.725e+14 3.825e+14 -5.005e+14 9.505e+13 -1.288e+14
met method middl might militari million mind mine minor minut mislead
1.358e+14 1.355e+14 -5.027e+13 -1.818e+14 1.361e+14 -2.050e+14 2.227e+14 -2.030e+14 -9.649e+13 2.613e+13 -3.194e+14
miss mistak modern mom moment money monkey month moron most mother
-3.101e+14 -2.570e+14 -1.686e+14 -4.300e+13 -3.545e+14 -1.088e+14 1.383e+14 -2.850e+14 1.573e+13 2.898e+14 1.570e+14
motherfuck motiv mouth move movement movi much multipl murder music muslim
5.400e+14 2.751e+14 2.730e+14 -2.735e+14 6.749e+13 9.037e+12 5.267e+12 4.832e+13 6.527e+12 -3.371e+14 3.292e+13
must name nation nationalist natur nazi near necessari need negat neither
2.825e+13 -2.234e+13 3.992e+13 -2.443e+14 -1.544e+14 4.281e+14 1.746e+13 -4.222e+14 -1.045e+14 1.225e+13 9.651e+13
nerd net neutral never new news newspap `next` nice nigga nigger
1.828e+14 -1.733e+14 -2.786e+14 6.166e+13 -1.691e+12 2.414e+14 5.181e+14 5.735e+13 2.578e+13 -6.458e+13 1.748e+12
night nobodi nomin non none nonsens normal notabl note noth notic
1.314e+14 2.745e+13 -2.904e+14 1.823e+13 8.225e+13 2.568e+14 -4.560e+12 -1.853e+14 -1.274e+14 1.961e+14 -1.375e+14
novemb now npov number numer object observ obvious occur octob odd
-3.905e+14 7.552e+13 4.396e+13 -1.023e+13 -6.511e+14 -3.400e+11 -4.076e+14 -1.583e+13 -7.316e+14 -4.278e+12 -2.667e+14
offend offens offer offic offici often okay old one onlin open
1.321e+14 1.707e+14 -5.234e+13 -8.468e+13 4.836e+12 -5.787e+13 -1.492e+14 2.868e+13 -3.381e+13 -2.394e+14 6.071e+13
oper opinion oppos opposit order org organ origin other otherwis outsid
-3.518e+14 -6.441e+13 -2.638e+14 -3.375e+14 -1.271e+14 3.402e+14 3.209e+14 -9.124e+13 -6.053e+13 1.712e+14 -7.063e+14
own page paid pain paper paragraph parent part parti particip particular
3.846e+14 -1.835e+13 1.085e+14 1.802e+14 2.353e+14 -1.481e+14 -2.793e+14 -1.142e+14 -3.293e+13 -7.807e+13 -3.546e+14
pass past pathet pay peni peopl per percent perfect perform perhap
-1.188e+14 5.357e+13 4.046e+14 2.774e+14 1.209e+14 1.518e+14 -5.800e+14 7.662e+13 2.222e+14 9.017e+13 -3.392e+14
period person photo phrase physic pick pictur piec pig pillar piss
-3.563e+12 -1.653e+14 -3.855e+14 -9.798e+13 7.160e+13 -1.138e+14 -5.089e+13 1.400e+14 3.501e+11 3.261e+14 4.534e+14
place plain plan play player pleas plenti plus point polic polici
9.547e+13 2.088e+14 -1.999e+14 2.037e+14 -3.063e+14 -2.382e+14 -4.746e+14 1.377e+14 -1.747e+14 -1.735e+14 -5.423e+13
polit poor pop popul popular porn posit possibl post potenti power
-8.235e+12 3.510e+14 1.386e+14 -1.120e+14 -3.518e+14 4.336e+14 -2.146e+13 -1.450e+14 2.965e+13 8.090e+13 -2.254e+14
practic preced prefer present presid press pretend pretti prevent previous prick
-1.168e+14 -2.090e+14 -2.572e+14 -4.047e+14 -1.945e+14 -1.999e+14 1.576e+14 2.046e+13 1.099e+14 7.554e+13 3.594e+14
primari prior privat pro probabl problem process produc product profession program
2.546e+13 4.304e+13 -1.879e+14 -4.994e+14 -1.406e+13 -6.431e+13 -1.802e+14 -3.333e+14 8.485e+13 1.043e+14 -9.512e+13
progress project promot proof propaganda proper propos protect prove provid public
-2.732e+14 4.875e+12 2.391e+13 1.672e+14 1.820e+14 2.503e+14 -2.161e+14 -7.125e+13 -9.342e+13 -6.370e+13 -1.195e+14
publish pull punish punk puppet pure purpos push pussi put qualifi
2.063e+13 -1.623e+14 4.549e+13 1.302e+14 3.905e+14 5.165e+14 1.286e+14 -2.067e+14 -8.167e+13 1.206e+14 -1.732e+14
qualiti question quick quit quot race racist rais random rape rate
-2.732e+14 -1.597e+14 -4.588e+14 1.942e+14 -1.787e+14 -1.409e+14 1.274e+14 6.720e+13 -3.074e+14 3.090e+13 -1.215e+14
rather rational reach read reader real realiti realiz realli reason receiv
-2.879e+14 8.658e+13 -6.878e+14 -2.302e+13 -2.484e+14 1.159e+14 3.295e+14 -2.005e+14 7.644e+13 -2.594e+13 -3.236e+14
recent recogn recommend record red redirect ref refer referenc reflect refrain
-9.184e+13 1.828e+14 -4.674e+14 -3.536e+14 1.022e+14 -8.586e+14 -3.742e+14 -8.851e+13 -3.001e+14 3.058e+14 3.929e+14
refus regard regardless region regist regular relat relationship releas relev reliabl
4.162e+14 -2.499e+14 -2.683e+14 -4.204e+14 -1.304e+13 7.421e+14 -2.631e+14 2.504e+14 1.032e+13 5.704e+13 9.945e+13
religi religion remain remark rememb remind remov renam `repeat` replac repli
1.941e+14 -2.011e+14 -1.646e+14 -1.671e+14 -1.683e+14 -3.748e+14 2.894e+12 -3.779e+14 8.871e+13 -1.315e+13 -1.450e+14
report repres reput request requir research resolv resourc respect respond
-8.366e+13 1.226e+14 -9.080e+13 -2.128e+14 5.565e+13 8.395e+13 -4.326e+14 -3.057e+14 -8.260e+13 -1.585e+14
[ reached getOption("max.print") -- omitted 409 entries ]
Degrees of Freedom: 29321 Total (i.e. Null); 27914 Residual
Null Deviance: 40650
Residual Deviance: 575200 AIC: 578000
Let’s do the inference on the test set now.
# Split the test set
X_test = df_test[,-1]
Y_test = df_test$df_toxic
# Do the inference on the test set
tic("Inference: ")
Y_pred <- predict(f,X_test,type='response')
prediction from a rank-deficient fit may be misleading
toc(log = TRUE)
Inference: : 7.35 sec elapsed
# Add the predictions class and compare to the real values
predictions = as.data.frame(Y_pred)
predictions$predictions = round(predictions$Y_pred)
predictions$real = Y_test
# What does the confusion matrix gives us?
writeLines("\n")
mat = confusionMatrix(as.factor(Y_test), as.factor(predictions$predictions))
mat
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 2544 3239
1 245 5596
Accuracy : 0.7003
95% CI : (0.6919, 0.7086)
No Information Rate : 0.7601
P-Value [Acc > NIR] : 1
Kappa : 0.399
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.9122
Specificity : 0.6334
Pos Pred Value : 0.4399
Neg Pred Value : 0.9581
Prevalence : 0.2399
Detection Rate : 0.2189
Detection Prevalence : 0.4975
Balanced Accuracy : 0.7728
'Positive' Class : 0
print("END: all the notebook ran.")
[1] "END: all the notebook ran."
Sys.time()
[1] "2021-08-23 22:42:03 CEST"
writeLines(paste0("File: ", fileName))
File: bow_tfidf__min_words_100_2grams_1000__sampling_balanced__cor_cut_1_from_1408_to_1408.csv
writeLines(paste0("Accuracy: ", mat$overall[1]))
Accuracy: 0.700275292498279
writeLines(paste0(tic.log(format = TRUE)[1][1]))
Training: : 1392.45 sec elapsed
writeLines(paste0(tic.log(format = TRUE)[2][1]))
Inference: : 7.35 sec elapsed